Complete Chloroplast Genomes of Pterodon emarginatus Vogel and Pterodon pubescens Benth: Comparative and Phylogenetic Analyses

Background The species Pterodon emarginatus and P. pubescens, popularly known as white sucupira or faveira, are native to the Cerrado biome and have the potential for medicinal use and reforestation. They are sister species with evolutionary proximity. Objective Considering that the chloroplast genome exhibits a conserved structure and genes, the analysis of its sequences can contribute to the understanding of evolutionary, phylogenetic, and diversity issues. Methods The chloroplast genomes of P. emarginatus and P. pubescens were sequenced on the Illumina MiSeq platform. The genomes were assembled based on the de novo strategy. We performed the annotation of the genes and the repetitive regions of the genomes. The nucleotide diversity and phylogenetic relationships were analyzed using the gene sequences of these species and others of the Leguminosae family, whose genomes are available in databases. Results The complete chloroplast genome of P. emarginatus is 159,877 bp, and that of P. pubescens is 159,873 bp. The genomes of both species have circular and quadripartite structures. A total of 127 genes were predicted in both species, including 110 single-copy genes and 17 duplicated genes in the inverted regions. 141 microsatellite regions were identified in P. emarginatus and 140 in P. pubescens. The nucleotide diversity estimates of the gene regions in twenty-one species of the Leguminosae family were 0.062 in LSC, 0.086 in SSC, and 0.036 in IR. The phylogenetic analysis demonstrated the proximity between the genera Pterodon and Dipteryx, both from the clade Dipterygeae. Ten pairs of primers with potential for the development of molecular markers were designed. Conclusion The genetic information obtained on the chloroplast genomes of P. emarginatus and P. pubescens presented here reinforces the similarity and evolutionary proximity between these species, with a similarity percentage of 99.8%.


INTRODUCTION
The chloroplast (cp) is a semi-independent organelle that is closely and constantly related to the nucleus since most of its genes are lost or transferred to the nucleus throughout evolutionary processes [1,2].This genome usually has a circular structure, divided into four parts: a short singlecopy region (SSC), another large single-copy region (LSC), and two inverted repeats (IRa e IRb) [3][4][5].It varies in size genes and indels, in addition to lower lengths of intergenic regions and introns.Thus, the cp genome has fewer repetitive sequences, a smaller size, low nucleotide substitution rates, uniparental inheritance, and a relatively more conserved structure than the nuclear and mitochondrial genomes [1,2].
Genomic studies allow us to know the architecture of chloroplast genomes, and their information can contribute to understand the evolution of chloroplast genomes in plants.Chloroplast genome sequencing data have been used in several evolutionary, phylogenetic, and diversity studies [1,9].Chloroplast DNA sequences have also been used to investigate similarities and differences between the genomic structures of phylogenetically close species, as in the works of Dipteryx alata [10], Rambutan [11], Prunus [12], in addition to phylogenomic studies [13].Thus, complete chloroplast sequences are effective tools to answer related questions, for example, to understand the evolution of species and genetic diversity, with results that represent some important knowledge to understand better the evolutionary history of species and genera [13].
Pterodon emarginatus Vogel and Pterodon pubescens (Benth.)Benth are sister species belonging to the family Leguminosae, subfamily Papilionoideae, and tribe Dipterygeae [14][15][16].Both are popularly known as faveira or white sucupira and are widely distributed in Cerrado areas in Brazil.The literature points out to a recent process of separation between P. emarginatus species, which has possibly undergone some diversification due to its lower number and frequency of haplotypes, as well as lower rates of diversity.Thus, as species are continuously distributed geographically and without barriers between the P. pubescens and P. emarginatus species, this recent separation may be a consequence of sympatric or parapatric speciation [17].It is worth highlighting that both species are very important as genetic resources.
The main current and potential uses of these species are medicinal, with scientific studies demonstrating several pharmacological properties of their seed oil, including analgesic, anti-inflammatory, antitumor, antioxidant, angiogenic, and antimicrobial properties, in addition to being useful in the recovery of degraded areas [18][19][20][21][22][23].Studies on the chloroplast genomes of the sucupira-branca help understand the evolutionary relationships of these species.
Considering the evolutionary proximity between P. pubescens and P. emarginatus and their valuable genetic resources, which should be better defined, we sequenced and assembled the complete chloroplast genomes of the two species to learn the levels of conservation and diversity of these genomes using comparative genomics methods.Our objectives were 1) to assess the general structure of the genomes and the diversity of chloroplast genes in P. pubescens and P. emarginatus with other species of the Leguminosae family; 2) to identify microsatellite regions with potential for the development of specific chloroplast microsatellite markers for P. pubescens and P. emarginatus; 3) to determine the phylogenetic relationships between these Pterodon species based on chloroplast gene sequences.
This study is supported by the hypothesis that the chloroplast genomes of these two species of the genus Pterodon show a high percentage of similarity.

Obtaining P. pubescens and P. emarginatus DNA Samples and Preparing the Libraries and DNA Sequencing
We collected the samples of young leaves from P. pubescens in a region of the Cerrado biome (Voucher: 61013) in the city of Goiânia, GO, Brazil (latitude: -16,577,772: altitude: -49,273,777).Total DNA was isolated following the extraction protocol proposed by Doyle and Doyle [24].The sequencing library was prepared using the Nextera DNA Flex Library Prep Kit (Illumina).For P. emarginatus, young leaves were collected from an adult individual, also from the Cerrado region, in Planaltina, DF, Brazil (latitude: -1,600,000; longitude: -47,658,000, Voucher 68411).Cp DNA was extracted using the Chloroplast Isolation Kit (Sigma-Aldrich).DNA libraries were prepared following the protocol of the Agilent Technologies Kit, SureSe-lectQXT.The P. pubescens and P. emarginatus DNA libraries were sequenced in two independent sequencing runs, one for each library, both using the Illumina MiSeq platform with the v3 600 cycles kit (2x300 bp, paired-end reads).

Assembly of the Chloroplast Genome
The quality of the reads was evaluated using FastQC software [25], and sequencing adapters and bases with a Phred value <20 were removed using the Trimmomatic software [26].The reads of P. pubescens species were mapped and filtered from the total genomic DNA by comparing them with DNA sequences from chloroplasts of other plant species from the NCBI Genbank database.This filtering was performed based on the creation of a database containing the chloroplast sequences available in the NCBI RefSeq for all angiosperm species until June 2020.The reads were aligned to this database using the bowtie2 software [27], and all mapped reads were separated for subsequent assembly of P. pubescens chloroplast sequences.The same methodology was applied to the P. emarginatus reads to verify if the Chloroplast Isolation Kit extracted only chloroplast DNA.
The assemblies of the chloroplast genomes were performed according to the de novo strategy.The assemblies were performed using the Fast-Plast pipeline, which contains an assembler based on the Bruijn graphs called Spades 3.11.1 [28].
Microsatellite regions in the chloroplast sequences were identified using the MISA program (MIcroSAtellite identification tool) [34], according to the following parameters: for mono and dinucleotides, a minimum of ten and five repetitions in tandem, respectively; for trinucleotides, a minimum of four repetitions, and pentanucleotide and hexanucleotide, three repetitions.We performed the analysis using RepeatMasker to identify repetitive elements, lowcomplexity sequences, and interspersed repetitions.
The location and size of the repeats (forward, reverse, complementary, and palindromic) were identified using the REPuter software.The primers were designed on the Pri-mer3 software [35].The following parameters guided the design of the primers for future amplification of the chloroplast microsatellite regions: 150-400bp for the amplification product size, 30-60 GC percentage, 56-62 °C for the melting temperature value (Tm), and primer sizes between 18-27 bp.The choice of mononucleotide and dinucleotide microsatellite regions with an AT motif was avoided since, despite being frequent in the genome, they can generate hairpinshaped structures, reducing the efficiency of PCR amplification.

Comparative Analyses of Chloroplast Genomes
The complete chloroplast genomes of P. pubescens and P. emarginatus were compared with each other and with four other species: Pterodon abruptus, D. alata, Styphnolobium japonicum, and Lupinus luteus, using the mVISTA software.The P. emarginatus cp genome was used as the reference, and the analysis followed the Shuffle-LAGAN method [36].After the mVISTA analysis, the BLAST of all sequences was performed, comparing them with P. emarginatus as a reference to determine the percentage of similarities between the genomes [37].
The nucleotide diversity analysis was performed using the gene sequences of P. pubescens (ON360975) and P. emarginatus (ON360974) and twenty-one other legume species available in the NCBI database: ADA clade (Pterodon abruptus, MW628952.1;Dipteryx alata, MT119080.Phaseolus vulgaris, NC_009259.1).These species were also used in the phylogenetic analysis along with two other species from other families as outgroups.The following species were used as outgroups: Cucumis sativa (DQ119058), be-longing to the Cucurbitaceae family, and Fragraria vesca (JF345175) of the Rosaceae family.
The gene sequences were aligned using the ClustalW software [38], and the genes were concatenated using the MEGA X software [39].The nucleotide diversity of the chloroplast gene sequences was estimated using the DnaSP6 software [40], with each gene analyzed individually.
The concatenated array of chloroplast genes was used for the phylogenetic analysis, performed using the MEGA X software [39].The dendogram was constructed using the maximum likelihood method based on the Tamura-Nei model.The support of the nodes was assessed using the bootstrap method with 1000 replicates.The dendogram was edited using the FigTree software v. 1.4.3 (http://tree.bio.ed.ac.uk/software/figtree/).

RESULTS
Chloroplast genomes of P. pubescens and P. emarginatus have very similar sizes with P. emarginatus being slightly larger (159,877 bp) than P. pubescens (159,873 bp), a difference of only 4 bp.Both chloroplast genomes have a quadripartite structure composed of two inverted regions (IR), with 25,638 bp in P. emarginatus and 25,119 bp in P. pubescens, separated by a large single-copy region (LSC), with 89,177 bp in P. pubescens and 89,174 bp in P. emarginatus, and a small single-copy region (SSC), with 20,458 bp and 19,427 bp small for P. pubescens and P. emarginatus, respectively (Table 1).
The GC content was 35% of the total chloroplast genome, 42% in the IR regions, and 32% and 29% in the LSC and SSC regions, respectively (Table 1) and Fig. (1) highlights the position and direction of the genes distributed in these four regions.

Gene Annotation and Microsatellite Regions Identified in the Chloroplast Genomes of P. pubescens and P. emarginatus
In total, 127 genes were predicted in the chloroplasts of both species, out of which 110 are single copies, and 17 are duplicates.The predicted genes are involved in several metabolic pathways, such as photosynthesis, self-replication, and biosynthesis.All genes exhibit high nucleotide sequence similarity and position in the chloroplast genomes of P. emarginatus and P. pubescens.In the SSC portion, 13 genes were found, including only one tRNA (trnL-UAG) and the remaining protein-coding genes, whereas the LSC region has a total of 77 different genes, including 21 tRNAs (Table 2).
Table 4 shows the microsatellite-type repetitions separated by type of repetition motif.Overall, both species have similar and conserved chloroplast microsatellite regions, with 141 regions in P. emarginatus and 140 in P. pubescens.Mononucleotide are the most common type of chloroplast microsatellite in both species, with approximately 75% repeating base A. Both species presented the same number of microsatellites of tetra-, penta-, and hexanucleotide types.
According to the RepeatMasker software, no satellitetype repetitions or transposable elements were identified.In turn, the REputer software analyzed the 50 best results and found the same values for both species in the IR region for forward, reverse, palindromic, and complementary-type repeats, corroborating the studies that report that the invert-ed repeated regions are more conserved.The LSC and SSC regions showed similar values, as shown in Table 5.
Based on the identified microsatellite regions and established criteria, 10 pairs of primers were designed to amplify the chloroplast microsatellite regions in Pterodon species, five for each species (Table 6).

Comparison of the Chloroplast Genomes of Pterodon emarginatus and P. pubescens with Other Species of Leguminosae
The coding regions showed the following values of nucleotide diversity: 0.062 for LSC, 0.086 for SSC, and 0.036 for the IR region (Fig. 2); therefore, the latter had the lowest diversity.The genes with the highest diversity peaks were trnG-UCC, trnK-UUU, and ycf4, present in the LSC region, with values of 0.23, 0.22, and 0.16, respectively.In the SSC region, there was a diversity of 0.15 for the rpl32 and 0.2 for the ycf1, which is partly in the IR region.
The rps16 gene is absent in both the annotated Pterodon species, as well as in P. abruptus; however, it is present in the other four species of the Dipterygeae.The rps19 gene was identified incompletely in P. emarginatus and P. pubescens, but not annotated in D. alata.The non-coding regions showed a greater divergence than the coding ones, and the LSC and SSC regions were more divergent than the IRs.The alignment also revealed differences in the order of the annotated genes.The genomes of L. luteus, for example, present inversion regions that were not identified in Pterodon.The analysis using the mVISTA software (Fig. 3) compared the complete genomes, highlighting the gene regions in blue, which are more conserved in the chloroplast genomes.In P. emarginatus and P. pubescens, all genes are the same and in the same position.However, not all genes are common to all species.On the other hand, the psbB and trnL-UAG and trnQ-UUG genes appear in the other four species but were not found in D. alata.The similarity between the two species studied, according to a comparative analysis performed with BLAST, was 99.8%.In comparison with P. abruptus, D. alata, S. japonicum, and L. luteus, it was 99.5, 97.9%, 93.6%, and 95.9%, respectively.
The assembled sequences for Pterodon allowed us to reconstruct the expected tree topology for legumes.All nodes had moderate to high support.The dendogram (Fig. 4) demonstrates the phylogenetic relationships and approximation between Clades ADA and Cladrastis.The Pterodon and D. alata species were grouped, which was expected since they belong to the Dipterygeae tribe.Amburana was grouped with Dussia, both from the Amburaneae tribe.All species of the ADA clade were also grouped, including the Angylocalyx braunii of the Angylocalyceae tribe.

DISCUSSION
In this study, the chloroplast genomes of the species Pterodon emarginatus and P. pubescens showed a quadripartite structure, typical of Angiosperms, with a long and a short region separated by two inverted regions [3,5].The genomes of the studied species, P. emarginatus (159,877 bp) and P. pubescens (159,873 bp), had similar sizes.Additionally, their sizes are close to that of the chloroplast genomes of other Leguminosae species, such as D. alata, with 158.6 kb [10], and 151.8 kb for L. luteus [41].
The research found 127 genes in the two Pterodon species, only two fewer than the number of genes in the chloroplast genome of D. alata [10].The total number of genes in chloroplast genomes is approximately 130, as observed, for example, in the species Cedrela odorata (MG724915), Khaya senegalensis (KX364458.1),and Carapa guianensis (MF401522.1)[42].The number of genes varies between species due to the losses that occurred throughout the evolutionary process, duplication events, or transfers between organellar and nuclear genomes [2,10,43].
The rps19 gene, responsible for forming small ribosomal subunit proteins, is incomplete in both Pterodon genomes.D. alata was analyzed, and the aforementioned gene was also not identified, suggesting that the gene was partially lost in the evolutionary process of the ADA clade [10,44].The ycf1 gene occurs in a border region between the SSC and IR regions, and the 504 bp is repeated at the edges of the IR regions, similar to P. vulgaris, in which this stretch of the gene has 505 bp [44].In G. max, the repetition has 478 bp [45], which presented a high value of diversity herein.This result represents the greatest variation among all genes analyzed in the inverted regions, which increased the diversity of the region.
The genes with the greatest diversity were ycf1, trnG-UCC, trnK-UUU, ycf4, rpl32, and ClpP.When studying the Cersis chuniana, of the Fabaceae family, [46] found the greatest diversities in the trnT-trnL, psbZ-trnG, rpl32, rps3-rps19, and ycf1 gene regions, with values above 0.12.The causes and consequences of different evolutionary rates among protein-coding genes are a topic that deserves attention.Such events can be explained by disparities in generation time, relaxed selection, gene expression level, and gene function [47].
The rps16 gene is present in the genomes of other species of the ADA clade with sequences in databases; however, it is absent in Pterodon species, as well as in other legume species such as C. arietinum [48], P. vulgaris [44], and the genus Lupinus [49].In Vigna radiata, it is probably not functional as it contains three internal stop codons and its start codon is AGA [50].
The rps16 gene that encodes the ribosomal protein S16 is present in the chloroplast genome of most plants.However, it commonly involves multiple losses because of the double targeting of the rps16 nuclear copy to the plastid and the mitochondria.This suggests that the rps16 encoded by the chloroplast was already silenced and became a pseudogene, replaced by the rps16 encoded by the chloroplast core.Thus, to be confirmed, these events require sequencing strategies and more rigorous PCR [8,48].
The high similarity between the two species, according to the BLAST, and the comparison between the species P. abruptus, D. alata, S. japonicum, and L. luteus, suggest a phylogenetic proximity between Clades ADA and Cladrastis [10].The same finding was also applied to the dendogram.In general, chloroplast gene sequences are well-conserved, especially between phylogenetically close species, corroborating the high similarity between the chloroplastidial genes of both studied species.
As for the repetitive regions of the genome, [10] identified 131 microsatellites in the chloroplastidial genome of D.  alata, and [8] found 137 microsatellites in S. adstringens.These values are close to those identified in the species studied herein and are considered relatively low.Repetitive elements can contribute to genetic diversity.Such a low number of microsatellites indicates great conservation of the chloroplast genomes [45,51].

CONCLUSION
The large-scale data produced from the genomic sequencing extended the studies of plastids [52].The plastome plays a significant role in the speciation process by influencing a variety of ontogenetic characteristics [53].Genomic studies promote significant advances in phylogenetic analyses.In the case of P. pubescens and P. emarginatus, the differences presented in the CpGenomes are small since they are different species but highly similar to the point of forming hybrids in natural environments [54].Thus, this study opens ways for research involving information on the chloroplasts of species of the genus Pterodon.Additionally, we provide a deeper phylogenetic investigation of the ADA clade, contributing to conservation actions through the availability of primers for studies with molecular markers.

HUMAN AND ANIMAL RIGHTS
No animals/humans were used for studies that are the basis of this research.

Fig. ( 2 ).
Fig. (2).Comparison of nucleotide diversity in the large single copy, small single copy and inverted regions among twenty one species of the Leguminosae family.(A higher resolution / colour version of this figure is available in the electronic copy of the article).

Fig. ( 3 ).
Fig. (3).Comparison of Leguminosae chloroplast genomes using mVISTA.The chloroplast genome of Pterodon emarginatus was used as a reference, followed by P. pubescens, P. abruptus, D. alata, S. japonicum and L. luteus.Blue blocks represent conserved genes and red blocks indicate conserved non-coding sequences (CNS).The Y-axis indicates the percentage of identity, which varies between 50 and 100%.Sequence variations of the 5 species are shown in white regions.The orientation of the genes is shown in the gray arrows above the alignment.(A higher resolution / colour version of this figure is available in the electronic copy of the article).

Fig. ( 4 ).
Fig. (4).Dendogram constructed using the maximum likelihood method with 1000 replicates for bootstrap support, highlighting the ADA clade.(A higher resolution / colour version of this figure is available in the electronic copy of the article).